Efficient Pairwise Document Similarity Computation in Big Datasets
نویسندگان
چکیده
منابع مشابه
An Efficient Document Indexing-Based Similarity Search in Large Datasets
In this paper, we principally devote our effort to proposing a novel MapReduce-based approach for efficient similarity search in big data. Specifically, we address the drawbacks of using inverted index in similarity search with MapReduce and then propose a simple yet efficient redundancy-free MapReduce scheme, which not only takes advantages over the baseline inverted index-based procedures but...
متن کاملInvestigating Measures for Pairwise Document Similarity
The need for a more effective similarity measure is growing as a result of the astonishing amount of information being placed online. Most existing similarity measures are defined by empirically derived formulas and cannot easily be extended to new applications. We present a pairwise document similarity measure based on Information Theory, and present corpus dependent and independent applicatio...
متن کاملPairwise Document Similarity in Large Collections with MapReduce
This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a colle...
متن کاملEfficient Graph-Based Document Similarity
Assessing the relatedness of documents is at the core of many applications such as document retrieval and recommendation. Most similarity approaches operate on word-distribution-based document representations fast to compute, but problematic when documents differ in language, vocabulary or type, and neglecting the rich relational knowledge available in Knowledge Graphs. In contrast, graph-based...
متن کاملEfficient structural similarity computation between XML documents
This work is mainly motivated by the description of a new approach for calculating the structural similarity of XML documents. Practically, the majority of existing work on XML documents clustering considers the tree structures of these documents as mere vectors and, therefore, does not take into account their hierarchical contexts. Furthermore, in order to calculate the structural similarity o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal of Database Theory and Application
سال: 2015
ISSN: 2005-4270,2005-4270
DOI: 10.14257/ijdta.2015.8.4.07